A Statistical and Rule-based Method for Chunking Verbal Units in Thai Texts
نویسندگان
چکیده
Tokenizing a text into a sequence of words is an important process towards text interpretation. This process is required in many applications such as text summarization, semantic search, and machine translation. Instead of splitting into words, recently there have been works on chunking into units which are larger than words. Text chunking is a process to divide a running text into non-overlapping groups of words, which have meaningful contents, such as named entities and verbal units. In this work, we explore three layers of verbal units, called (1) verbal sequences, (2) verb phrases (i.e., verbal chunks, causative forms and event occurrences), and (3) elementary discourse units (EDUs). As the basic layer, a verbal sequence is defined as a single verb or a sequence of contiguous verbs without any interrupting nouns or particles. For example, A Statistical and Rule-based Method for Chunking Verbal Units in Thai Texts
منابع مشابه
Chunking Using Conditional Random Fields in Korean Texts
We present a method of chunking in Korean texts using conditional random fields (CRFs), a recently introduced probabilistic model for labeling and segmenting sequence of data. In agglutinative languages such as Korean and Japanese, a rule-based chunking method is predominantly used for its simplicity and efficiency. A hybrid of a rule-based and machine learning method was also proposed to handl...
متن کاملتعیین مرز و نوع عبارات نحوی در متون فارسی
Text tokenization is the process of tokenizing text to meaningful tokens such as words, phrases, sentences, etc. Tokenization of syntactical phrases named as chunking is an important preprocessing needed in many applications such as machine translation information retrieval, text to speech, etc. In this paper chunking of Farsi texts is done using statistical and learning methods and the grammat...
متن کاملJapanese Unknown Word Identification by Character-based Chunking
We introduce a character-based chunking for unknown word identification in Japanese text. A major advantage of our method is an ability to detect low frequency unknown words of unrestricted character type patterns. The method is built upon SVM-based chunking, by use of character n-gram and surrounding context of n-best word segmentation candidates from statistical morphological analysis as feat...
متن کاملJapanese Named Entity Extraction with Redundant Morphological Analysis
Named Entity (NE) extraction is an important subtask of document processing such as information extraction and question answering. A typical method used for NE extraction of Japanese texts is a cascade of morphological analysis, POS tagging and chunking. However, there are some cases where segmentation granularity contradicts the results of morphological analysis and the building units of NEs, ...
متن کاملA Punjabi Grammar Checker
This article provides description about the grammar checking software developed for detecting the grammatical errors in Punjabi texts and providing suggestions wherever appropriate to rectify those errors. This system utilizes a full-form lexicon for morphology analysis and rule-based systems for part of speech tagging and phrase chunking. The system supported by a set of carefully devised erro...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013